Puno Department
Omnilingual ASR: Open-Source Multilingual Speech Recognition for 1600+ Languages
Omnilingual ASR team, null, Keren, Gil, Kozhevnikov, Artyom, Meng, Yen, Ropers, Christophe, Setzler, Matthew, Wang, Skyler, Adebara, Ife, Auli, Michael, Balioglu, Can, Chan, Kevin, Cheng, Chierh, Chuang, Joe, Droof, Caley, Duppenthaler, Mark, Duquenne, Paul-Ambroise, Erben, Alexander, Gao, Cynthia, Gonzalez, Gabriel Mejia, Lyu, Kehan, Miglani, Sagar, Pratap, Vineel, Sadagopan, Kaushik Ram, Saleem, Safiyyah, Turkatenko, Arina, Ventayol-Boada, Albert, Yong, Zheng-Xin, Chung, Yu-An, Maillard, Jean, Moritz, Rashel, Mourachko, Alexandre, Williamson, Mary, Yates, Shireen
Automatic speech recognition (ASR) has advanced in high-resource languages, but most of the world's 7,000+ languages remain unsupported, leaving thousands of long-tail languages behind. Expanding ASR coverage has been costly and limited by architectures that restrict language support, making extension inaccessible to most--all while entangled with ethical concerns when pursued without community collaboration. To transcend these limitations, we introduce Omnilingual ASR, the first large-scale ASR system designed for extensibility. Omnilingual ASR enables communities to introduce unserved languages with only a handful of data samples. It scales self-supervised pre-training to 7B parameters to learn robust speech representations and introduces an encoder-decoder architecture designed for zero-shot generalization, leveraging a LLM-inspired decoder. This capability is grounded in a massive and diverse training corpus; by combining breadth of coverage with linguistic variety, the model learns representations robust enough to adapt to unseen languages. Incorporating public resources with community-sourced recordings gathered through compensated local partnerships, Omnilingual ASR expands coverage to over 1,600 languages, the largest such effort to date--including over 500 never before served by ASR. Automatic evaluations show substantial gains over prior systems, especially in low-resource conditions, and strong generalization. We release Omnilingual ASR as a family of models, from 300M variants for low-power devices to 7B for maximum accuracy. We reflect on the ethical considerations shaping this design and conclude by discussing its societal impact. In particular, we highlight how open-sourcing models and tools can lower barriers for researchers and communities, inviting new forms of participation. Open-source artifacts are available at https://github.com/facebookresearch/omnilingual-asr.
- North America > Canada > Alberta (0.14)
- Europe > Austria > Vienna (0.14)
- Africa > Sudan (0.14)
- (53 more...)
- Health & Medicine (1.00)
- Education (0.67)
- Information Technology (0.67)
Quechua Speech Datasets in Common Voice: The Case of Puno Quechua
Huaman, Elwin, Huaman, Wendi, Huaman, Jorge Luis, Quispe, Ninfa
Under-resourced languages, such as Quechuas, face data and resource scarcity, hindering their development in speech technology. To address this issue, Common Voice presents a crucial opportunity to foster an open and community-driven speech dataset creation. This paper examines the integration of Quechua languages into Common Voice. We detail the current 17 Quechua languages, presenting Puno Quechua (ISO 639-3: qxp) as a focused case study that includes language onboarding and corpus collection of both reading and spontaneous speech data. Our results demonstrate that Common Voice now hosts 191.1 hours of Quechua speech (86\% validated), with Puno Quechua contributing 12 hours (77\% validated), highlighting the Common Voice's potential. We further propose a research agenda addressing technical challenges, alongside ethical considerations for community engagement and indigenous data sovereignty. Our work contributes towards inclusive voice technology and digital empowerment of under-resourced language communities.
- South America > Peru > Puno Department > Puno Province > Puno (0.91)
- South America > Peru > Lima Department > Lima Province > Lima (0.05)
- South America > Ecuador (0.04)
- (6 more...)
Optimization of Energy Consumption Forecasting in Puno using Parallel Computing and ARIMA Models: An Innovative Approach to Big Data Processing
Vilca-Tinta, Cliver W., Torres-Cruz, Fred, Quispe-Morales, Josefh J.
This research presents an innovative use of parallel computing with the ARIMA (AutoRegressive Integrated Moving Average) model to forecast energy consumption in Peru's Puno region. The study conducts a thorough and multifaceted analysis, focusing on the execution speed, prediction accuracy, and scalability of both sequential and parallel implementations. A significant emphasis is placed on efficiently managing large datasets. The findings demonstrate notable improvements in computational efficiency and data processing capabilities through the parallel approach, all while maintaining the accuracy and integrity of predictions. This new method provides a versatile and reliable solution for real-time predictive analysis and enhances energy resource management, which is particularly crucial for developing areas. In addition to highlighting the technical advantages of parallel computing in this field, the study explores its practical impacts on energy planning and sustainable development in regions like Puno.
- South America > Peru > Puno Department > Puno Province > Puno (0.86)
- North America > Trinidad and Tobago > Trinidad > Arima > Arima (0.64)
- South America > Argentina (0.04)
- (13 more...)
- Research Report > New Finding (0.48)
- Research Report > Promising Solution (0.40)
- Overview > Innovation (0.40)
- Information Technology (1.00)
- Energy > Power Industry (1.00)
- Energy > Renewable > Hydroelectric (0.46)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Architecture (1.00)
- Information Technology > Data Science > Data Mining > Big Data (0.51)
Shortcomings of LLMs for Low-Resource Translation: Retrieval and Understanding are Both the Problem
This work investigates the in-context learning abilities of pretrained large language models (LLMs) when instructed to translate text from a low-resource language into a high-resource language as part of an automated machine translation pipeline. We conduct a set of experiments translating Southern Quechua to Spanish and examine the informativity of various types of information retrieved from a constrained database of digitized pedagogical materials (dictionaries and grammar lessons) and parallel corpora. Using both automatic and human evaluation of model output, we conduct ablation studies that manipulate (1) context type (morpheme translations, grammar descriptions, and corpus examples), (2) retrieval methods (automated vs. manual), and (3) model type. Our results suggest that even relatively small LLMs are capable of utilizing prompt context for zero-shot low-resource translation when provided a minimally sufficient amount of relevant linguistic information. However, the variable effects of prompt type, retrieval method, model type, and language-specific factors highlight the limitations of using even the best LLMs as translation systems for the majority of the world's 7,000+ languages and their speakers.
- North America > Canada > Ontario > Toronto (0.05)
- North America > United States > Ohio (0.04)
- Europe > Italy > Tuscany > Florence (0.04)
- (13 more...)
- Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Killkan: The Automatic Speech Recognition Dataset for Kichwa with Morphosyntactic Information
Taguchi, Chihiro, Saransig, Jefferson, Velásquez, Dayana, Chiang, David
This paper presents Killkan, the first dataset for automatic speech recognition (ASR) in the Kichwa language, an indigenous language of Ecuador. Kichwa is an extremely low-resource endangered language, and there have been no resources before Killkan for Kichwa to be incorporated in applications of natural language processing. The dataset contains approximately 4 hours of audio with transcription, translation into Spanish, and morphosyntactic annotation in the format of Universal Dependencies. The audio data was retrieved from a publicly available radio program in Kichwa. This paper also provides corpus-linguistic analyses of the dataset with a special focus on the agglutinative morphology of Kichwa and frequent code-switching with Spanish. The experiments show that the dataset makes it possible to develop the first ASR system for Kichwa with reliable quality despite its small dataset size. This dataset, the ASR model, and the code used to develop them will be publicly available.
- South America > Bolivia (0.04)
- South America > Peru > Puno Department > Puno Province > Puno (0.04)
- South America > Ecuador > Pichincha Province > Quito (0.04)
- (8 more...)
ELSA -- Enhanced latent spaces for improved collider simulations
Nachman, Benjamin, Winterhalder, Ramon
Simulations play a key role for inference in collider physics. We explore various approaches for enhancing the precision of simulations using machine learning, including interventions at the end of the simulation chain (reweighting), at the beginning of the simulation chain (pre-processing), and connections between the end and beginning (latent space refinement). To clearly illustrate our approaches, we use W+jets matrix element surrogate simulations based on normalizing flows as a prototypical example. First, weights in the data space are derived using machine learning classifiers. Then, we pull back the data-space weights to the latent space to produce unweighted examples and employ the Latent Space Refinement (LASER) protocol using Hamiltonian Monte Carlo. An alternative approach is an augmented normalizing flow, which allows for different dimensions in the latent and target spaces. These methods are studied for various pre-processing strategies, including a new and general method for massive particles at hadron colliders that is a tweak on the widely-used RAMBO-on-diet mapping. We find that modified simulations can achieve sub-percent precision across a wide range of phase space.
- North America > United States > California > Alameda County > Berkeley (0.14)
- North America > United States > New York (0.04)
- South America > Peru > Puno Department (0.04)
- (4 more...)
Evaluating Self-Supervised Speech Representations for Indigenous American Languages
Chen, Chih-Chen, Chen, William, Zevallos, Rodolfo, Ortega, John E.
The application of self-supervision to speech representation learning has garnered significant interest in recent years, due to its scalability to large amounts of unlabeled data. However, much progress, both in terms of pre-training and downstream evaluation, has remained concentrated in monolingual models that only consider English. Few models consider other languages, and even fewer consider indigenous ones. In our submission to the New Language Track of the ASRU 2023 ML-SUPERB Challenge, we present an ASR corpus for Quechua, an indigenous South American Language. We benchmark the efficacy of large SSL models on Quechua, along with 6 other indigenous languages such as Guarani and Bribri, on low-resource ASR. Our results show surprisingly strong performance by state-of-the-art SSL models, showing the potential generalizability of large-scale models to real-world data.
- South America > Brazil (0.05)
- North America > Canada > Ontario > Toronto (0.05)
- South America > Bolivia (0.05)
- (18 more...)
An Integrated NPL Approach to Sentiment Analysis in Satisfaction Surveys
The research project aims to apply an integrated approach to natural language processing NLP to satisfaction surveys. It will focus on understanding and extracting relevant information from survey responses, analyzing feelings, and identifying recurring word patterns. NLP techniques will be used to determine emotional polarity, classify responses into positive, negative, or neutral categories, and use opinion mining to highlight participants opinions. This approach will help identify the most relevant aspects for participants and understand their opinions in relation to those specific aspects. A key component of the research project will be the analysis of word patterns in satisfaction survey responses using NPL. This analysis will provide a deeper understanding of feelings, opinions, and themes and trends present in respondents responses. The results obtained from this approach can be used to identify areas for improvement, understand respondents preferences, and make strategic decisions based on analysis to improve respondent satisfaction.
- Information Technology > Artificial Intelligence > Natural Language > Information Extraction (0.94)
- Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (0.94)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)
Comparative Analysis of Libraries for the Sentimental Analysis
This study is main goal is to provide a comparative comparison of libraries using machine learning methods. Experts in natural language processing (NLP) are becoming more and more interested in sentiment analysis (SA) of text changes. The objective of employing NLP text analysis techniques is to recognize and categorize feelings related to twitter users utterances. In this examination, issues with SA and the libraries utilized are also looked at. provides a number of cooperative methods to classify emotional polarity. The Naive Bayes Classifier, Decision Tree Classifier, Maxent Classifier, Sklearn Classifier, Sklearn Classifier MultinomialNB, and other conjoint learning algorithms, according to recent research, are very effective. In the project will use Five Python and R libraries NLTK, TextBlob, Vader, Transformers (GPT and BERT pretrained), and Tidytext will be used in the study to apply sentiment analysis techniques. Four machine learning models Tree of Decisions (DT), Support Vector Machine (SVM), Naive Bayes (NB), and K-Nearest Neighbor (KNN) will also be used. To evaluate how well libraries for SA operate in the social network environment, comparative study was also carried out. The measures to assess the best algorithms in this experiment, which used a single data set for each method, were precision, recall, and F1 score. We conclude that the BERT transformer method with an Accuracy: 0.973 is recommended for sentiment analysis.
- South America > Peru > Puno Department > Puno Province > Puno (0.05)
- Asia > India (0.04)
- Information Technology > Artificial Intelligence > Natural Language > Information Extraction (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)
QICHWABASE: A Quechua Language and Knowledge Base for Quechua Communities
Huaman, Elwin, Lindemann, David, Caruso, Valeria, Huaman, Jorge Luis
Over the last decade, the Web has increasingly become a space of language and knowledge representation. However, it is only true for well-spread languages and well-established communities, while minority communities and their resources received less attention. In this paper, we propose QICHWABASE to support the harmonization process of the Quechua language and knowledge, and its community. For doing it, we adopt methods and tools that could become a game changer in favour of Quechua communities around the world. We conclude that the methodology and tools adopted on building QICHWABASE, which is a Wikibase instance, could enhance the presence of minorities on the Web.
- South America > Peru > Puno Department > Puno Province > Puno (0.05)
- Europe > Spain > Basque Country (0.05)
- Europe > Italy (0.05)
- (3 more...)